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1 Abstract 

In the present paper we discuss a general approach to solve Boundary Value 
Problems numerically in a parallel environment. The basic algorithm con- 
sists of two steps: The local step where all the P available processors work 
in parallel, and the global step where one processor solves a tridiagonal lin- 
ear system of the order P. The main advantages of this approach are two 
fold: First - this suggested approach is very flexible, especially in the local 
step and thus the algorithm can be used with any number of processors and 
with any of the SIMD or MIMD machines. Secondly - the communication 
complexity is very small and thus can be used as easily with shared memory 
machines. Several examples for using this strategy are discussed. 


Work partially funded by Space Act Agreement C-99066-G. 
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2 Preliminaries 

One of the main problems that appears in mathematical computing when 
a new concept of computer hardware arrives is how to redesign efficiently 
the existing numerical schemes, or how to invent new numerical schemes so 
as to take advantage of the special capabilities of the new machine. Now it 
is quite evident that the forthcoming computers in the present decade will 
have substantial capabilities of parallelism. The present paper deals with 
reformulation of numerical schemes for boundary value problems (BVP 
hereafter), taking advantage of the availability of the parallel computers on 
the market, and trying to use their features to the full extent. Numerical 
methods for solving BVPs are well established. These methods were de- 
signed mainly for serial computers and have been tested intensively because 
of their relative importance: not only because they appear in many phys- 
ical fields as the basic governing equations, but also because they appear 
as a sub-problem of solving Partial Differential Equations of the elliptic 
and parabolic type [ Lin (1986d)]. The parallel algorithms for BVPs that 
will be presented here are based on the general strategy for creating high 
order three point finite difference (FD hereafter) numerical schemes that 
were presented before [Lin (1986a) and Lin (1986b)] and on the new paral- 
lel technique for solving inherent serial problems [Lin (1986c)]. These two 
elements will be briefly discussed later. Several different parallel engines 
are already in use by the scientific community and many more are com- 
ing along the line. With such a wide variety of parallel computer systems 
and architectures, the challenge is to develop parallel algorithms that are 
efficient and portable from one machine to the other. There are several 
possibilities for doing this for a given algorithm [Gannon et al (1985)], but 
expressing the algorithm in terms of modules with high level of granular- 
ity is optimal in some sense of portability. The algorithms that will be 
presented hereafter are well suited for most of the machines, keeping their 
features unchanged. However, in order to make the presentation and the 
analysis easy, we’ll refer to a certain type of a parallel engine model. 

The parallel engine model for which the present algorithms are easily 
applied is of the SIMB (single instruction stream multiple data stream) 
type [Flynn (1966)] as well as of the MIMD (multiple instruction multiple 
data) type, with any number of processor-elements (PE) P, where each PE 
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has its own ALU (arithmetic logic unit) to perform standard calcula- 
tions. 


• may have its own local memory. 

• can be masked by a disable signal, for leaving it idle during some 
period of time. 

Although the categories of MIMD and SIMD are too crude, we can find 
generally two types of machines that are considered in the literature: 

• the shared memory machines (where a common memory is available 
for all the PEs). 

• the interconnection network machines (where all the PEs are con- 
nected via a specific network) . 

Parallel machines of the second type appear to be more realistic [Dekel et 
al (1981), the INTEL Hypercube machine - Seitz (1985) and the IBM RP3 
machine - Pfister et al (1985)], mainly because the number of connections 
per PE is small (log P for the Cube network and 3 for the Shuffle-Exchange 
network), while for the shared memory type of machine this number is large. 
Therefore, although the algorithm can be applied for every machine, the 
interconnection network-like machine will be considered primarily hereafter. 

The main goal of this work is to formulate a parallel numerical scheme 
for Boundary Value Problems and to evaluate its performance for the above 
type of parallel engines. Basically the present method uses the parallel 
technique idea developed in Lin (1986c). Let us consider the following 
general BVP: 

$zz = ff(x,$,$*) (*) 


where we define 




ax 


d 2 $ 

dx 2 



( 2 ) 


g is a (nonlinear) functional of $ , and the coordinate x. Equation (l) 
has to be solved over the domain Cl = (L,R) , L < R where L is its left 
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boundary and R is its right boundary. For simplicity we will consider the 
following Dirichlet boundary conditions for the BVP: 

$(L) = $ l ; $(£) = $* (3) 

where <&i and are know quantities. We will deal here only with cases 
where equation (l) has a unique solution. This means that the quantities: 

. _ dg _ dg 

d$ x ’ d$ ’ 

and g are continuous over f 1 and e > 0 [Lin (1986a)]. 

When trying to solve numerically equation (1) using traditional ap- 
proaches, we face two basic questions. The first is how to handle computa- 
tionally the nonlinearity of g, and the second is how to represent numerically 
the spatial derivatives of $. When parallel computing is considered, there 
are some more questions that arises, for example: what do we mean by 
“solving equation (1)”? does it mean that we should be able to evaluate $ 
(within certain accuracy) at each point x £ fl or does it mean to suggest 
values for $ at a finite number of points in fl. It turns out that the design 
and the efficiency of parallel algorithms for BVP depend very much on this 
and other similar questions. 

In the first step we would like to factor out the effects of the nonlinearity 
from the computational scheme since it is not a major issue of the present 
parallel algorithm. It was shown that the non-linearities of eq.(l) can be 
treated reasonably simply by using some kind of an iterative procedure, 
where at every iteration stage, a linear BVP is considered. Usually, a 
Newton-like quasi-linearization is used for nonlinear BVPs [Lin ( 1986b) ], 
which results in the following linearized BVP of eq.(l): 

$zz = 9 = d{x) - 6(x)$* - e(x)$ (5) 

which has to be solved every iteration. From now on we’ll concentrate on 
solving in parallel the boundary value problem associated with the above 
equation. 

3 The problem Pl(g,X;C). 

Before discussing the parallel algorithms, we introduce the following Pi 
problem, which turns out to be an important sub-problem that the final 
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algorithm solves. The PI problem is related to eq.(5) as is discussed in 
section 4. Let us define first the following variables: 

• the real vector X : X T = (x 1 ,x 2 ,x 3 ). If L < X\ < x 2 < x 3 < R then 
X is an acceptable vector. Thus an acceptable X is a vector of three 
increasing elements. 

• the real vector <f> : 4> T = ($ 1 , $ 2 , $ 3 , 1), where = $(xi), i = 1,2,3. 
Usually we’ll denote it by 0(X). Thus <f> is a vector of four elements, 
where the first three are the values of $ at the three points given by 
the three elements of X. 

• the real vector C : C T = (c 1( 1, c 3 ,c 4 ) is a vector of four scalars, three 
of which are unknown, and are part of the solution of PI. 

• the forward and the backward spacings around the point x 2 are: h = 
x 3 — x 2 , k = x 2 — x 1 where z, are elements of an acceptable vector 
X. 

With these definitions, the following problem is defined: 

The Problem PI: Given some acceptable vector X, find a coefficient vec- 
tor C such that 

C T <f> = ci$(xi) + $(x 2 ) + C 3 $(x 3 ) + C 4 = 0 (6) 

where the first three entries of <£(X) are the discrete values of the 
function $ that fulfill eq.(5) [or eq.(l)]. 

As we shall see later this problem is well defined and its solutions exist. 

The relevance of this problem to our parallel algorithm is that in all of its 
versions each of the processors solves a problem of a similar type to that 
of PI. The elements of the vector X in these implementations are three 
successive elements of the key points set that will be discussed later in the 
parallel algorithm. One of the main features of PI that is needed later in 
the paper is the following theorem: 

THEOREM: 1 The solution C to the problem Pi/gr,X;Cj exists and is 

unique. 
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The following lemmas are needed for the proof of this theorem: 

LEMMA: 1 The n ih derivative of $ that fulfills eq.(5) can be expressed gen- 
erally by: 

$ (n) = «„(*)$' + /?„(x)$ + 7n(*) , n = 2,3,-. (7) 

where the coefficients of this equation are governed by the following recurrence 
system: 

{ <*n+i = a'n ~ bot n + 0 n ; a 0 = - b(x ) 

Pn + 1 = 0’ n ~ ea n ; (3 0 = ~e[x) ( 8 ) 

7n+l = in + da n \ 70 = d{x) 

where the primes are the derivatives with respect to x. 


LEMMA: 2 For a given Pl(g,X;C) problem and a given integer t , the fol- 
lowing are approximations for and of the order of (f + 1).* 

f |[1 + Q* (/?; k)] 9 t + [Q t (a- h) + + Q t (r, h) - * 3 | » O(h^) 

1 III + Qt(P; -*)]* 2 + [Q<(*; -k) - fc ]*' 2 + Q f (7; -k) - *i| « 0 (A:‘ +1 ) 

( 9 ) 

where a , /? and 7 are the sets of functions a = {a,} , j3 = {/?,} , 7 = { 7,- } , 
and for any set of functions f = {/, (x)}, = 2 the functional Q is defined as: 

Qt{f-,z) = £7j7«( x ) ( 10 ) 

i—2 *' 


The proof of the first lemma is by induction, using the definitions of the 
coefficients in eq.(2). Algebraicially, the proof is simple and it assumes that 
these coefficients are smooth enough. The second lemma can be proven in 
a similar way to the proof of the correctness of the high order accurate 
numerical schemes for BVPs [Lin (1986a)]: First we expand and <J> 3 into 
a Taylor series around $ 2 . Then the high order derivatives of $ appear 
in these expansions can be replaced by a linear combination of and $ 
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using equation (7) in Lemma 1 and Lemma 2 follows. 

Now, by eliminating the derivative of $ 2 in equations (9), it can be 
shown that $i, $ 2 and $s are linearly related where the coefficient of $ 2 
is always different from zero. This shows that the relation between all the 
components of <f> is unique and thus theorem 1 is proved. 

There are several ways for solving numerically the Pi problem for eq.(5), 
which, for a given BVP, is to find the dependency of the values of 4>(x), for 
some specific x € fl, on the boundary conditions. We shall elaborate on one 
possibility, although other possibilities can be also considered as well, as 
long as they do not contradict the requirements that appear in the parallel 
algorithm scheme (section 4). Say that we spread n grid points over fl, 
where the number of the point Xi is 1 , the number of the point X 3 is n and 
k is the number of the point x 2 , 1 < A; < n. Given a desired accuracy order 
t, a FD approximation for eq.(5) that is spread over 3 grid points can be 
generated [Lin (1986a)], by applying equations (9) at the i th grid point: 


[Qt(«; hi) + /i, •]$,•-! 


+{[l + Q t (/9; /i,)][Q t (a; — — [1 + Qt{Pl — ■ ^«-i)][Q<( a ; K + /*,]}$, 

[Qt(^*i h-i— 1) hi— l]^,'+l 

= Q t(75 -hi-i)[Qt{a; hi) + hi} - Qt("y; -fci-i)[Qf(«; hi) + /i,] 


where 


hi — S'f+l 

Doing this for all the internal grid points in fl, the following tridiagonal 
like system is obtained: 


1 + bj<t>j + rj<t>j + 1 = dj + ej<f>i + fj<f> n , J — 2 , ...,n 1 ( 11 ) 

with / 2 = r„_i = 0 , and the nature of the FD equations is that e ; = 0 for 
j < 3 and f, = 0 for j < n - 2. With this approach the solution vector C 
to the problem Pi is obtained in two steps: 

step 1 : the contribution of the lower diagonal entries, are eliminated 
from the 2 nd equation (j = 3) till the equation for j = k. 
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step 2: the contributions of the upper diagonal entries, r ; , are eliminated 
from the n — 1 equation (j = n — 3) backwards till the equation for 
3 = k - 

Following is a PASCAL program that realizes this algorithm: 

PROCEDURE Solve.Pl ( VAR a, b, c, d, e, f : VECTOR 

VAR C : SOLUTION ; 

n,k : INDEX ); 

VAR j : INDEX ; 
m : REAL ; 

BEGIN 

FOR j : =3 TO k DO 
BEGIN 

m :=a[j]/b[j-l] ; 

b [ j ] := b [ j ] - m*c[j-l] ; 

d[j] := d[j] - m*d[j -1] ; 
e[j] := - m*e[j-l] ; 

END ; 


FOR 

j :=n-2 

DOWNTO k DO 

BEGIN 

m 

:= c[j + l]/b[j] 


b[j] 

:= b [ j ] - m*a[j + l] 


d[j] 

:= d[j] - m*d[j + l] 


f Cj] 

:= - m*f[j + l] 

END : 



c [1] 

: = -e[k]/b[k] ; 

c [2] 

:= 1 

f 

c [3] 

:= -f[k]/b[k] ; 

c [4] 

:= -d[k]/b[k] ; 


END ; 

For the simple case of k = n — 1 we get the known folding algorithm 
(Wang(l981)]. In order to evaluate the performance of this program the 
following definitions and notations are needed: 
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» AS” : The time needed to execute ADD or SUBTRACT on one processor. 

”M”: The time needed to execute MULTIPLICATION on one processor. 

”D”: The time needed to execute DIVISION on one processor. 

”CS”: The time needed to execute CHANGE SIGN on one processor. 

No w, the complexity of the above program is given by the followingjemmaj 

LEMMA: 3 The time T needed for one processor to finish the execution of 
the program Solve_Pl is independent of the location k of of the point x 2 
inside the the set of n points - 

T — En (12) 

where 

E = 3M + 2AS + D + CS (13) 


The way the grid points are spread over f 1 depends on the error require- 
ments and on some pre-knowledge of the solution’s behavior. However, as 
the order of accuracy is raised, this sensitivity is reduced [Lin (1986b)] for 
reasonable (polynomial) solutions. For very steep (exponential) solutions, 
an adaptive version [Lin (1990)] of this algorithm has to be considered. It 
should be noted that in order to solve PI one does not need to solve for 
$(i) over fl. Thus, for example, the use of a shooting method to solve 
this problem may have some disadvantage in terms of the computational 
complexity of the solution as in using this technique for solving Pi it is 
necessary to solve also the function itself over fl. In order to solve for C, 
we need at least three independent shootings and thus its complexity is 
T = En where n is the number of steps in fl, and usually E > 2 E for most 
of the second order accurate numerical schemes for initial value problems. 
Methods that use superposition and orthonormalization techniques [Scott 
et al (1977)] are favorable in this case since they may end up solving for C 
without solving for $(z) ; however, they will do it by solving a full linear 
system (and not form a simple tridiagonal system like here). Hereafter we’ll 
present and discuss two basic parallel algorithms to solve the BVP in (5). 
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4 The parallel Algorithm PARA1. 

The first parallel algorithm to solve numerically equation (5) uses a similar 
strategy developed in [Lin (1986c)]; this strategy consists of the following 
three major steps: 

Step 1: Choose a set W of P + 2 internal discrete points in : W = 
(xo, Xi, ..., xp, £p+i} with the understanding that xo = L and xp + i = 
R and x, < x, + i for 0 < t < P. These points are the key points which 
split the domain into P subdomains. The choice of W is usually based 
on some estimation of the subregions in fl over which it is expected 
to have relatively large error in the solution, as well as on the upper 
bounds requirements on the errors and on the order of accuracy as 
will be discussed later in the paper. We define a set of P acceptable 
vectors Y — {Y,-}|Lj as follows: 

Y, = (x,_i,Xj,x, +1 ) T , t = 1,2, .., P (14) 

Step 2: Solve in a parallel manner P problems of the type: Pi (3, Y t ; C,), 
i = 1,2, ...,P, where the i th processor solves the i th problem inde- 
pendently of the other processors. Thus, each processor i suggests a 
relationship of the type: 

cfti = c*;$(s,_i) + *(*.•) + 4 *(x,- +1 ) + 4 = 0 (15) 

The important issue here is to find the set of vectors C = 
such that the accuracy requirements on C are fulfilled and that all the 
processors will finish this task in the same time. Later in the paper 
we’ll discuss the sensitivity of this demand in cases where not all the 
PEs finish their task in the same time, and the tradeoffs between this 
demand and the accuracy requirements. This step is sometimes called 
the local step as it is local to the subdomain defined by Y| and local 
to the processor i. 

Step 3: Solve the following tri-diagonal linear system for the set of vectors 

* = <*}£f = 

Cf4t= 0 , t = l,2 P (16) 
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using one of the processors. This step is called the global step since 
the results of all the processors interact here to produce the final 
solution. 

A possible way to execute step 2 is by letting each processor t spread a FD 
grid over Y,- (e.i. between the grid point and x, +i ) and execute the 
program Solve_Pl. The number of grid points and the way they are chosen 
in Y, is determined by the requirements on the accuracy of the coefficient 
vector C,. Its accuracy is related to the accuracy of the solution of $ 
over Y,. In any event, the resultant algebraic system is usually a banded 
system. Moreover, in Lin (1986a) it was shown that for every BVP it is 
possible to find a numerical scheme that will be accurate to any order of 
accuracy, and still keep the FD approximation spread over only a three- 
grid-point stencil. The idea behind this was explained before, and it relies 
on the recursive usage of equations (7) and (8) as is given by equation (9). 
Using this approach, let us spread m, grid points over Y,- and construct the 
tri-diagonal linear system for the solution $ that is governed by eq.(5) [or 
eq.(l)]. Now we can easily find, at any given internal grid the vector C,. 
According to Lemma 3, the time T { needed for the i th processor to finish 
the execution of Step 2, is independent of the location of the point x-x inside 
the m, points: T{ = Errii. For the last step of this algorithm we have: 


LEMMA: 4 The time R needed to execute Step S is 


R = FP 

(17) 

where 


F = 4M + 3 AS + 2D 

(18) 


The total time needed by the algorithm is defined by T tota t = maXj(Ti) + 
R. This approach is somewhat similar to that of Kowalik et. al. (1985), 
where a parallel algorithm for solving a tridiagonal system was presented. 
In general we should mention the SIMD partition algorithm for tridiagonal 
systems of Wang (1981) and the partioning algorithm for banded systems 
of Dongarra et al (1984). In the present case, when all the PEs are identical 
to each other, it can be easily shown, that for all the processors to finish 
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Step 2 in the same time, all the m, should be equal to each another, while 
usually the error is determined by the total number of points m in ft: 

%=p 

m=Y, m i ( 19 ) 

1=1 

The speedup is defined as the ratio between the time needed to solve the 
problem on the serial machine and the total time that is needed to solve this 
problem on a parallel machine using the suggested algorithm. Sometimes 
this definition is confusing not only because the algorithms for the serial 
machine and for the parallel one are not the same, but also because of the 
definition of the term “the same problem”. For example in our case, finding 
the approximated values of <fr at the set W of the key points (see Step 1 
in the algorithm) is a different problem for the parallel machine than the 
problem of finding all the m approximated values of <f > , while for the serial 
machine it will be the same problem. The efficiency of the algorithm, rj, is 
defined as the ratio between the speedup and the number of processors P. 
The following property of r) is applied in our case: 


LEMMA: 5 If only the P values of <f> are required for the final solution, then 


\* = e£?fp> otherwise r, = 


* 


This lemma is proven simply by substituting the expressions for T and 
R. An interesting case, yet not so important, is when the total number of 
grid points in fi is fixed and we have the flexibility to choose the number 
of processors P. In this case the following lemma is applied: 

LEMMA: 6 The optimum number of processors for a given m is P = Ky/m 
where K = is a constant which depends on the machine features. 


This result can be verified by minimizing the total time with respect to 
P. The efficiency in this case is 77 = which is « | for most of the 

machines [see also Kowalik et al (1985)]. The sensitivity of the total time 
to the number of processors in this case is AT tota i = pj t ^ mal (Ai 3 ) 2 . It can 
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be seen that as the optimal number of processors increases the total time 
will not increase as much. 

5 The parallel Algorithm PARA2. 

The second parallel algorithm for solving numerically eq. (5) that will be 
considered here is similar to the first one, PARA1, in the general sense, 
but its local and global steps are much more closely tied to each another 
than in the first one. It consists of the following four steps: 

Step 1: Similar to the first step of PARAl, choose a set W of P internal 
discrete points in fl which are the key points: W = {z 0 , xj, ..., xp, xp+\} 
with the understanding that x 0 = L and xp + 1 = R. Given a posi- 
tive real number h , add additional P + 1 points Z = { z* } so that 
Xi — Zi = hi. For i = 1,2, ...,P + 1 define the subdomain fl, as 
fl, = [z,-,x, + j], Yj is now a vector that has two vector components: 
yJ x) = (z,-,x,-,z, +1 ) r and yJ 2) = (xi,z, +1 ,x i+ i) T . 

Step 2: Solve in a parallel manner P systems of the type: P 1 (j 7 ,Y} ; 

C,^) ; k = 1,2 ; * = 1,2,...,P, where the i th processor solves the 

i th system of the two problems independently of the others. 

( 2 ) 

Step 3: Each processor t , i = 1,...,P — 1 sends its solution vector C, 

( 2 ) 

to the processor * + 1. Then each processor * substitutes the C)_j 
and C,- 2 ^ results into the C,- 1 ^ vector result for z,, eliminating the 
contributions of z, and z, + j. The new vectors cj 1 ^ will be denoted by 

Step 4: Solve the following tri-diagonal linear system for the set <f> = 

m=J: _ 

Cf <f>i = 0 , i = 1,2, ..., P (20) 

using one of the processors. 

The way step 2 is executed is very similar to that of step 2 of the previous 
algorithm; here each processor t spreads a FD grid over [xj,z, +1 ] domain, 
with, say, n, grid points and executes the program Solve_Pl once with 
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k = m- 1 to solve for C (1) and once with k = 2 to solve for C (2) . The main 
difference between the two algorithms is that in the PARAl algorithm the 
domain [x,,z,+i] is solved twice during Step 2 by two different processors, 
while in the PARA2 algorithm this domain is solved twice by the same 
processor. It can be shown that the two algorithms have similar computa- 
tional complexities and share the lemmas that have been discussed in the 
previous section. In some sense PARA1 algorithm is similar to one that 
appears in Kowalik et al (1985) and PARA2 algorithm is similar to that 
of Dongarra et. al. (1984). However, the effectiveness of PARA2 can be 
observed when only P/2 subdomains are considered, and two processors are 
attached to each subdomain. Each of the two processors solves one of the 
two problems appear in step 2. Although step 3 is executed with (P + l)/2 
processors, the efficiency is a little better than that of PARAl. In this 
result we did not take into consideration that the rest (idling) processors 
in step 3, can still help on a different (finer) parallel grain level. 


6 Computational Tests and Analysis. 

We have tested intensively the parallel numerical algorithms for different 
BVPs. To illustrate such a test and to demonstrate the potential of these 
algorithms to solve numerically BVPs in a parallel manner much faster and 
more exact than in the serial mode, we considered the following two point 
non-linear boundary value problem: 

£(*) = - i[e 2 * + (*') ! ] = 0 (21) 

with <£(0) = 0 and ^(l) = - In 2. It has the exact solution <f>(x ) = - ln(l + 
x). Using the quasi-linear approach mentioned before, this equation is 
solved iteratively. Denoting by superscript the iteration’s number, and the 
difference of two successive solutions by z: 

z = $(j+ 1 ) _ $(>') (22) 

then the linear equation that is solved in the j + 1 iteration is: 

z" _ e 2 *<» z _ + L($W) = o (23) 
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Table 1: The efficiency of the parallel algorithms as function of the itera- 
tions. 
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We have used two parallel environments: the INTEL-Hypercube and the 
Alliant-FX/8. Although the results for the interconnection time axe not as 
accurate, the measures for the CPU time in these models is quite good. In 
the present tests we have considered methods that use fourth order schemes. 
Three types of schemes to solve the Pi problem were considered: (1) the 
adaptive high order scheme of Pereyra et al (1979), (2) the shooting scheme 
of De Boor et. al. (1983) using the fourth order Runga-Kutta method 
and (3) the high order three point scheme of Lin (1986b). Out of the P 
available processors, processors number 1,4,7, ...use scheme 1, processors 
2,5,8, . . . use scheme 2 and the rest use scheme 3 (on the Alliant we ran only 
two of the schemes in the same time). Table 1 summarizes the main results. 
The first guess for $, was a linear function and the iterations were 
stopped when || z ||oo< 10 -16 , It took about 6 iterations for both algorithms 
to converge. The variation in the efficiency with P is due to the different 
accuracy demands when computing the non-linear functions. Unlike other 
parallel algorithms [Dongarra et al (1984)], the algorithms used here do not 
require many more arithmetic operations than the appropriate sequential 
algorithm as is stated in the following lemma: 


LEMMA: 7 The redundancy of the PARA algorithms is O(P). 


This lemma, which is simple to prove, means that the present paral- 
lel algorithms requires a total number of operations that is greater by a 
constant times P than this number for the serial machine. This result is 
not bad when compared to other parallel algorithms. The communication 
complexity is of the order of O (M AX I x P), where MAXI is the maximum 
number of iterations (not counting the scattering of data to the processors, 
if needed, in the beginning of the algorithm). Another measure involves 
the computational cost and the communication cost is the ratio p of the 
computational time needed for the local step, and that that is needed for 
the global step. Usually, as p increases the algorithm is better in the above 
sense. For the present algorithms it can be proven the /i = ^, and as 
the accuracy demands increase (and thus also m,), fj. increases for a given 
machine. 
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